Introduction to Open Data Science - Course Project

1. About the project

This is a R Markdown file used for the “Introduction to Open Data Science” course at the University of Helsinki in 2023. The following lines are used to exercise the R Markdown syntax.

not-so-random thoughts

  • I am feeling quite tired at the moment, it’s Friday afternoon.
  • In this course, I expect to learn a few things:
    • using tidyverse as a modern tool for R (rather than R base)
    • learn how to use GitHub since I have no experience with it
    • learn how to version-control RStudio projects
  • I saw the course in SISU and in my PhD programme (ATM-DP) 1.

Here is a link to my GitHub repository: https://github.com/ntriches/IODS-project

# This is a so-called "R chunk" where you can write R code. 
# I don't have a code yet so I will leave this the way it is.

date()
## [1] "Mon Dec 11 16:37:13 2023"

Here below, I try to describe the work and results of the first week a. k. a. “warm up phase” of the IODS 2023 project.

Reflection on “Start me up!”

  • I had to re-do the whole project twice to be able to get my course diary live - no idea what the mistake was
  • I did not manage to commit-push with the Personal Access Token (PAT). The error was: remote: Permission to ntriches/IODS-project.git denied to ntriches. fatal: unable to access ‘https://github.com/ntriches/IODS-project.git/’: The requested URL returned error: 403
  • I tried to solve it according to a stackoverflow answer but that didn’t work neither
  • In the end, I uploaded an SSH key. With this, I can commit - push using the terminal, which works very well.

Reflection on learning experiences “R for Health Data Science”

… from “R for Health Data Science” chapter 1-4 and exercise set 1.

  • “R for Health Data Science” is a really nice book! I wish I had known it before
  • If I didn’t know anything about R / RStudio, I think I’d be quite overwhelmed, simply because there is so much information
  • My favourite topics were “2.11 joining multiple datasets” and “3.5 summarise() vs mutate()” and “4.9 multiple geoms, multiple aes()”

2. Regression and model validation

This file describes the work and results of the second week a. k. a. “Regression and model validation” of the IDOS2023 course.

2.1 Information on data set

The data set includes data which is based on a survey conducted in Finland between 3.12.2014 and 10.1.2015. The aim of the survey was to find the relationship between learning approaches and students achievements in an introductory statistics course in Finland (click here to see more information on the data set and the course). The dataset has 166 observations / rows with 7 variables / columns, of which four are numerical, two are integer and one is a categorical character variable (see output below for details). Note that only few of the originally recorded values are included in this data set.

# message = FALSE will not show any message from R

# Read and summarise file 
# load saved file from data wrangling exercise 
learning2014 <- read.table(file='/home/ntriches/github_iods2023/IODS23/data/learning2014.csv', header=TRUE, sep = ",")
str(learning2014) # structure of data
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014) # dimension of data
## [1] 166   7

The following figure shows a plot matrix of the 7 variables in the dataset, where each variable is plotted versus one other variable. The pink colour shows data from female students, whereas the blue colour shows data from male students of the survey. The scatter plots, distribution and correlation are therefore divided in both female and male students. The distribution of the data shows that the majority of the students were very young (< 25 years old), leading to a right skewed distribution of age. All other variables are relatively normally distributed, with some tendencies towards a left skewed distribution for points and deep, and the male attitude variables. This can be well seen in the outliers of the box plots on the top row of the matrix.

We can see that a better attitude towards the course, represented in a higher number, seems to lead to significantly higher points, in other words, to a better results in the exam (p < 0.001). It also appears that older male students might get less points in the exam (p < 0.1). The strategic learning approach (stra) might lead to better exam results, shown by a positive correlation (p < 0.1). On the other hand, we might assume that deep (deep) or surface learning (surf) are not successful learning approaches, as they show negative correlations with points (p < 0.1)

# fig. width and fig.height enlarge the figure below

# Graphical overview 
# load libraries 
library(ggplot2)
library(GGally)

# create plot matrix 
overview_plot <- learning2014 %>%
  ggpairs(mapping = aes(col = gender, alpha = 0.3),
          lower = list(combo = wrap("facethist", bins = 20))) +
  theme_grey(base_size = 20)
overview_plot

2.2 Regression model

For the multivariable regression model, I selected the three explanatory (predictor) variables age, attitude, and strategic learning (stra). The dependent target variable is points. For my model, I set the significance level at p = 0.05.

The model summary as shown below first shows a reminder of how the model was fit in the function call, then a summary of the distribution of the residuals, the results (“Coefficients”), and an indication of the general model quality (“Residual standard error”, “R-squared”, “F-statistic”). In the table of the model coefficients, we can see the estimated value of the coefficient with its estimated standard error, and the corresponding t-statistic and p-value. In the last part of the output, we can see an estimate of the residual standard error with the corresponding degrees of freedom (166 obs. - 3 = 163). With the multiple R-squared-value, we can see how much of the variance of points has been explained by the model. The adjusted R-squared takes into account the number of variables included in the model. The final line shows us the results of the F-test testing the hypothesis that all coefficients except the intercept are equal to zero.

In my model, attitude shows a positive relationship with points (p < 0.001). In other words, a positive attitude towards the course led to higher exam results: with an increase of 1 in attitude, the points in the exam are estimated to increase by 3.48 points. On the other hand, age and stra did not show any significant relationship with the exam results (p > 0.05). There are trends showing that points in the exams very slightly decrease with age but increase using the strategic learning approach, but they are not significant according to my definition. Overall, the explanatory variables I chose in my model only explain 21.8% of the variation in the data, as shown by the multiple R-squared of 0.2182. The fitted model has a residual standard error of 5.26 points.

If I remove age and stra from my model, attitude remains highly significant (p < 0.001) but the multiple R-squared decreases slightly to 19%. If I run the model with attitude and both stra and age separately, the multiple R-squared increases to 20.48% and 20.11%, respectively, but none of the explanatory variables other than attitude have a p-value higher than 0.05. Overall, the first model including three explanatory variables seemed to best explain the data, and shows that the attitude has a significant influence on the points in the exam.

# Regression model 

# create a regression model with three explanatory variables (stra, age, attitude)
lm_model_points_stra_age_attitude <- lm(points ~ stra + age + attitude, data = learning2014)
# show summary of the fitted model 
summary(lm_model_points_stra_age_attitude)
## 
## Call:
## lm(formula = points ~ stra + age + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08
# Does model (R-squared) improve with less explanatory variables? 
# create a simple linear regression with attitute 
lm_model_points_attitude <- lm(points ~ attitude, data = learning2014)
summary(lm_model_points_attitude) # no
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# create a regression model with two explanatory variables (stra, attitude)
lm_model_points_stra_attitude <- lm(points ~ stra + attitude, data = learning2014)
summary(lm_model_points_stra_attitude)
## 
## Call:
## lm(formula = points ~ stra + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
# create a regression model with two explanatory variables (age, attitude)
lm_model_points_age_attitude <- lm(points ~ age + attitude, data = learning2014)
summary(lm_model_points_age_attitude)
## 
## Call:
## lm(formula = points ~ age + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3354  -3.3095   0.2625   4.0005  10.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.57244    2.24943   6.034 1.04e-08 ***
## age         -0.07813    0.05315  -1.470    0.144    
## attitude     3.54392    0.56553   6.267 3.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1913 
## F-statistic: 20.52 on 2 and 163 DF,  p-value: 1.125e-08

2.3 Assumptions of the model and validity interpretation

The assumptions for linear regression are as follows: 1. Linearity, i. e., there is a linear relation between the explanatory and target variable. 2. Homoscedasticity, i. e., the variance of the target variable should be the same across the range of the explanatory variable. 3. Normality of the error terms, i. e., the error terms should follow a normal distribution with mean zero.

Because we have continuous variables, we can only assess the assumptions by using and looking at the residuals. The residuals are the difference between the observed and the fitted values. On the top left plot of the following figure (“Residuals vs Fitted”), we can see that the data is relatively normally distributed, so not deviating from the horizontal axis at Y = 0. Also the spread around the horizontal axis at Y = 0 is not deviating too much, so the assumption of homoscedasticity is fulfilled, too. The normality of the error terms can be seen in the top right plot of the following figure (“Normal Q-Q”). There are no major deviations from normality, so this assumption is also fulfilled. In the bottom left plot (“Residuals vs Leverage”), we can get an estimate on how much influence each observation had in the fitting of the regression model due to its explanatory variables.

# Plot Residuals vs Fitted values, Normal QQ-plot, and Residuals vs Leverage 

par(mfrow = c(2,2))
plot(lm_model_points_stra_age_attitude,
     which = c(1,2,5))
par(mfrow = c(1,1))


3. Logistic regression

This file describes the work and results of the third week a. k. a. “Logistic regression” of the IDOS2023 course.

3.1 Information on data set

The original data set includes data from student performances in the subjects “Mathematics” and “Portuguese language” in secondary education (high school) in two Portuguese schools. It was collected through school reports and questionnaires, and gives information on the demographic and social background of the students, as well as student grades and data on the school they visit (click here to see more information on the data set and access the data). One of the collected variables concerns the alcohol consumption of the students.

The data set we use has 370 observations / rows with 35 variables / columns, of which most are numerical integers and binary characters, but also some nominal characters where more than two options are available. It combines the data from two student alcohol consumption data sets, where ‘alc_use’ gives the average of the weekly alcohol consumption from 1 - very low to 5 - very high, and the logical variable ‘high_use’ = TRUE / FALSE indicates if students consume more than a little amount of alcohol (> 2).

# Read and summarise file 
# load saved file from data wrangling exercise 
alc <- read.table(file='/home/ntriches/github_iods2023/IODS23/data/alc.csv', header=TRUE, sep = ",")
colnames(alc) # names of variables 
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)      # dimension of data
## [1] 370  35

3.2 Analysis

3.2.1 Aim

The aim of my analysis is to study the relationships between high / low alcohol consumption and the gender of the students (“sex”, female / male), if they go out with friends often (“goout”, 1 - very low to 5 - very high), if they miss school often (“absences”, number of school absences), and how well they performed in their final grades (“G3”, 0 to 20). My personal hypotheses are as follows:

  1. Students identifying as female drink less (relatively) than students identifying as males.
  2. If the students go out with friends often, they drink more (= A higher number in goout results in high alcohol consumption).
  3. The more classes the students miss, the more likely they drink (= The higher the number of school absences, the more likely is a high alcohol consumption.).
  4. The more the students drink, the lower their final grade is (= Students showing high consumption of alcohol have lower results in their final grade).

3.2.2 Numerical and graphical exploration

The numerical exploration of our data shows that most students fall under the category of low alcohol consumption (< 2). Females and males are almost equally distributed (195 and 175, respectively). It is evident that in goout and absences, most students are clustered in the categories 3 and 4 (see medians). It is only with the means that we can see some differences.

library(dplyr)  # load needed library

# summary statistics by group
# differences between female / male and high / low alcohol consumption, mean of going out, mean of absences, mean of final grade
alc %>% 
  group_by(sex, high_use) %>% 
  summarise(count = n(), 
            mean_goout = mean(goout),
            median_goout = median(goout),
            mean_absences = mean(absences),
            median_absences = median(absences),
            mean_grade = mean(G3),
            median_grade = median(G3)
            )
## # A tibble: 4 × 9
## # Groups:   sex [2]
##   sex   high_use count mean_goout median_goout mean_absences median_absences
##   <chr> <lgl>    <int>      <dbl>        <dbl>         <dbl>           <dbl>
## 1 F     FALSE      154       2.95            3          4.25               3
## 2 F     TRUE        41       3.39            4          6.85               3
## 3 M     FALSE      105       2.70            3          2.91               3
## 4 M     TRUE        70       3.93            4          6.1                4
## # ℹ 2 more variables: mean_grade <dbl>, median_grade <dbl>

The clustering of the medians is also evident when looking at the boxplots (see below). In goout, the upper quartile equals the median for males in high use = FALSE, and in high use = TRUE for females. absences show a relatively high amount of outliers but generally an increase in high use = TRUE. In the grades (G3), we can immediately see that there are great differences between females and males.

library(dplyr)
library(ggplot2) # load needed libraries 
# install.packages("patchwork") to show boxplots next to each other
library(patchwork)

# box plot showing differences between sex concerning high consumption of alcohol and going out
plot_alc_use_goout_by_sex <- alc %>%
  ggplot(aes(x = high_use, y = goout, col = sex)) +
  geom_boxplot()
# box plot showing differences between sex concerning high consumption of alcohol and absences in school
plot_alc_use_absences_by_sex <- alc %>%
  ggplot(aes(x = high_use, y = absences, col = sex)) +
  geom_boxplot()
# box plot showing differences between sex concerning high consumption of alcohol and final grades
plot_alc_use_grades_by_sex <- alc %>%
  ggplot(aes(x = high_use, y = G3, col = sex)) +
  geom_boxplot()
# show and combine all plots 
plot_alc_use_goout_by_sex + plot_alc_use_absences_by_sex + plot_alc_use_grades_by_sex + plot_layout(guides = 'collect')

Concerning my hypotheses, this means the following:

  1. 40% of males show a high alcohol consumption (70 TRUE / 175 total * 100). Compared to 21 % of the females showing a high alcohol consumption (41 TRUE / 195 total * 100), I can assume that my first hypothesis, males drinking more than females, can be accepted.
  2. Females who have a high alcohol consumption go out more often (3.4) compared to those who go out less frequently (3). For males, the differences is even greater: those going out often (3.9) drink more compared to those who go out less (2.7). I can therefore assume that my seconds hypothesis, that those students who go out often drink more, can be accepted.
  3. Females seem to miss school more often than males but for both genders, the same trend is evident: they show a high use of alcohol with a higher number of absences. I can assume that my third hypothesis can be accepted.
  4. Males who show a high alcohol consumption achieve lower grades in their final exam compared to those with low alcohol consumption (10.3 and 12.3, respectively). For females, the differences in the final grades are smaller and show an opposite trend: with high alcohol consumption, the final grades are slightly higher than with low alcohol consumption (11.8 and 11.4, respectively). My fourth hypothesis, that high alcohol consumption results in lower final grades, can therefore only be accepted partially.

3.2.3 Logistic regression model

We use a logistic regression to statistically explore the relationship between the binary high / low alcohol consumption variable as the target variable, and the explanatory variables gender (sex), how often students go out (goout), their absences at school (absences), and their final grades (G3).

# find the model with glm()
model_high_use_sex_goout_absences_grades <- glm(high_use ~ sex + goout + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(model_high_use_sex_goout_absences_grades) # G3 (grades) not statistically significant 
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9500  -0.7990  -0.5263   0.8110   2.4703  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.58037    0.70357  -5.089 3.60e-07 ***
## sexM         1.01859    0.25986   3.920 8.87e-05 ***
## goout        0.70551    0.12167   5.799 6.68e-09 ***
## absences     0.08263    0.02267   3.645 0.000268 ***
## G3          -0.04508    0.03985  -1.131 0.258029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 372.82  on 365  degrees of freedom
## AIC: 382.82
## 
## Number of Fisher Scoring iterations: 4

We can see that absences, goout, and sexM show very significant positive correlations with a high alcohol consumption (p < 0.001), whereas the final grades (G3) do not show a significant relationship with alcohol consumption. sexM indicates that males are more likely to drink a lot of alcohol compared to females. In other words, male students, students who miss a lot of classes, and students who go out more often drink more alcohol.

# compute odds ratios (OR)
OR <- coef(model_high_use_sex_goout_absences_grades) %>% 
  exp 

# compute confidence intervals (CI)
CI <- confint(model_high_use_sex_goout_absences_grades) %>% 
  exp

# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %    97.5 %
## (Intercept) 0.02786536 0.006654554 0.1057209
## sexM        2.76927745 1.674593890 4.6488214
## goout       2.02488717 1.604625853 2.5882364
## absences    1.08614409 1.040269147 1.1384190
## G3          0.95592337 0.883818334 1.0338017

The odds ratios show that the odds that males have a high alcohol consumption are 2.8 with a 95 % confidence interval ranging from 1.7 to 4.6. Similarly, the odds of students going out more having a high alcohol consumption are 2 with a confidence interval ranging between 1.6 and 2.6.

3.2.4 Predictions

With our logistic regression model, it is possible to make predictions, and explore how well the models actually predicts the target variable high alcohol consumption. We can do this proving a 2x2 cross tabulation of predictions versus the actual values, and a graphic visualisation of predictions and actual values (see below). To improve our model, we remove the explanatory variable that didn’t have a significant relationship with the alcohol consumption (G3).

# run new model without grades (G3) because they had no significant influence on high_use
model_high_use_sex_goout_absences <- glm(high_use ~ sex + goout + absences, data = alc, family = "binomial")
# print out summary of model
summary(model_high_use_sex_goout_absences)
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8060  -0.8090  -0.5248   0.8214   2.4806  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.18142    0.48085  -8.696  < 2e-16 ***
## sexM         1.02223    0.25946   3.940 8.15e-05 ***
## goout        0.72793    0.12057   6.038 1.56e-09 ***
## absences     0.08478    0.02266   3.741 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 374.10  on 366  degrees of freedom
## AIC: 382.1
## 
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use and add to alc data frame 
alc <- mutate(alc, probability = predict(model_high_use_sex_goout_absences, type = "response"))
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions: 2x2 cross tabulation
table(high_use = alc$high_use, prediction = alc$prediction) %>%
  addmargins()
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   242   17 259
##    TRUE     61   50 111
##    Sum     303   67 370
# tabulate the target variable versus the predictions in percentages
table(high_use = alc$high_use, prediction = alc$prediction) %>%
  prop.table() %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65405405 0.04594595 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000
# plot 'high_use' versus 'probability' 
plot_probability_high_use_prediction <- alc %>%
  ggplot(aes(x = probability, y = high_use, col = prediction)) +
  geom_point()
plot_probability_high_use_prediction

We can see that our model predicts the target value very well: if student weren’t classified within the category of high alcohol consumption (high use = FALSE), the model predicted this correctly for 242 out of 259 students in 65% of the cases. If students were classified as drinking a lot of alcohol (high use = TRUE), the model predicted this correctly for 50 out of 111 students, in 13 % of the cases. This is also visible in the figure: In the upper line showing observations for students with a high alcohol consumption (TRUE), most observations can be found on the right side of the plot (probability > 0.5, prediction = TRUE in blue). Opposite to this, the lower line shows that students with a low alcohol consumption (FALSE), most observations lie on the left side (probability < 0.5, prediction = FALSE in red).

It is also possible to compute the total proportion of inaccurately classified individuals (= the training error). As we can see below, the model predicts around 30% (0.3) of the observations wrongly. Compared to 70% of correct predictions, this is an acceptable amount.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# call loss_func to compute the average number of right predictions in the (training) data
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7

3.2.5 Cross-validation

Cross-validation is a method which we can use to get a good estimate of the actual predictive power and the model. We can also use it to compre different models. As we can see below, my model shows an error of 0.21 with a 10-fold cross-validation. This is lower than the model in the Exercise set (error = 0.26).

library(boot) 

# 10-fold cross-validation
ten_fold_cross_validation <- cv.glm(data = alc, cost = loss_func, glmfit = model_high_use_sex_goout_absences, K = 10)
# average number of wrong predictions in the cross validation
ten_fold_cross_validation$delta[1]
## [1] 0.2135135

4. Clustering and classification

This file describes the work and results of the fourth week a. k. a. “Clustering and classification” of the IDOS2023 course.

4.1 Information on data set

The Boston data set from the MASS package in R consists of information on different characteristics for suburbs in Boston, Massachusetts, US. Variables include, amongst others:

  • crim: per capita crime rate per town,
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft. (2322.576 sq. meter),
  • nox: nitrogen oxides concentration (parts per 10 million),
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise),
  • rad: index of accessibility to radial highways.

The data set contains 506 rows (observations) and 14 columns (variables), of which all are numerical values and none are characters. chas is a binary integer, and rad an integer number. More information on the data set, its variables the abbreviations can be found here.

# access all packages needed in this chunk
library(MASS)
library(dplyr)
library(tidyverse)

# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Graphical and numerical overview

The black-and-white plot matrix below shows that numerous variables are grouped in two groups, e. g., chas (binary), rad, and tax. crim and zn indicate that many observations are 0 (i. e., no crimes and no large residential home, respectively). Most of the variables are not normally distributed: the proportion of owner-occupied units built prior to 1940 (age) is high, as well as the amount of black people living in the suburbs (black).

# access all packages needed in this chunk
library(dplyr)
library(tidyverse)

# show summaries of variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
pairs(Boston)

# histograms
Boston %>% 
     gather() %>% 
     ggplot(aes(x=value)) + geom_histogram(binwidth = 1) + facet_wrap('key', scales='free')

When we look at the coloured correlation matrix below, we can see the correlations between the variables more clearly. Big circles show a strong correlation, whereas small show a weak or no correlation, also noticeable with faint colour. The blue and red circles indicate a positive and negative correlation, respectively. A very strong positive correlation can, e. g., be seen between rad and tax. A strong negative collrelation can, e. g., be seen between age and dis (weighted mean of distances to five Boston employment centers).

# access all packages needed in this chunk
library(corrplot)
library(dplyr)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>%
  round(digits = 2)
# visualise the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

4.2 Scale the data set and divide it

To perform a linear discriminant analysis, it is necessary to scale the data. For this, we subtract the column means from the corresponding columns and divide the difference with the standard deviation:

\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]

When we look at the scaled data, we can see in the summary that the mean of all variables equals 0. Similarly, the standard deviation is 1 for all variables (not seen for all variables).

# center and standardize variables
boston_scaled <- Boston %>%
  scale()
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# change crim to numeric
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
sd(boston_scaled$zn)
## [1] 1
sd(boston_scaled$age)
## [1] 1
# create a quantile vector of crim 
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels)
# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# divide data set in test and train
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

In order to then predict what might happen in Boston’s suburbs in the future, we need to know how well the model we will use works. For this, we split the original data set into a train (80% of the data) and test set (20 % of the data). We can then train the model with the train set and predict with the test set.

4.3 Linear discriminant analysis

Linear discriminant analysis is a statistical method that tries to find linear combinations of explanatory variables and group them in differences that are as large as possible. It weighs the explanatory variables (predictors), creates functions out of it (so-called linear discriminant functions, i. e., LD1, LD2, LD3, see below) and distinguishes them as much as possible.

From the summary below, we can see that based on the training data, 25 % of the data set belongs to the low group, 25 % to med_low, 24% to med_high and 26% to high, respectively (“Prior probabilities of groups”). The proportion of trace shows the between-class variance in the different linear discriminant functions. Hence, 96.5% of the between-class variance is explained by the first linear discriminant function (LD1). The coefficients (of linear discriminants) indicate that rad (index of accessibility to radial highways) is very well represented in LD1 (4.06) compared to all other variables ranging around 0.

# crime = target variable, . = all other (explanatory) variables
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2351485 0.2475248 0.2450495 0.2722772 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9290593 -0.8694412 -0.106556426 -0.8452771  0.4257276 -0.8243451
## med_low  -0.1182824 -0.2789913  0.003267949 -0.5879136 -0.1167839 -0.4048797
## med_high -0.3737674  0.1435402  0.125357825  0.3677850  0.0794748  0.4328663
## high     -0.4872402  1.0169558 -0.093369966  1.0498174 -0.4015220  0.8029263
##                 dis        rad        tax     ptratio      black       lstat
## low       0.7953199 -0.6917322 -0.7596346 -0.48463936  0.3838651 -0.75387732
## med_low   0.3700278 -0.5466022 -0.4918629 -0.07784658  0.3628689 -0.17548697
## med_high -0.3668996 -0.4412797 -0.3527969 -0.27480023  0.1348722 -0.05706542
## high     -0.8423269  1.6397657  1.5152267  0.78268316 -0.7778077  0.87403996
##                 medv
## low       0.52920937
## med_low   0.01382975
## med_high  0.13185109
## high     -0.68302992
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07546570  0.86332960 -1.18275029
## indus    0.05490067 -0.08933777  0.15560210
## chas    -0.12919427 -0.02855052  0.12402187
## nox      0.30858020 -0.65856823 -1.21637015
## rm      -0.10446925 -0.08814759 -0.03363678
## age      0.24965607 -0.38786012 -0.33224179
## dis     -0.09088826 -0.24029972  0.33830435
## rad      3.39076171  1.01745769 -0.23715643
## tax      0.03018729 -0.24031926  0.88407995
## ptratio  0.09381865  0.13706298 -0.34424564
## black   -0.09635558  0.00261990  0.16256367
## lstat    0.20138019 -0.13082997  0.50793898
## medv     0.15294946 -0.21126281 -0.13786459
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9551 0.0312 0.0138
# load function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

This is confirmed by looking at the plot, where rad seems to be the only variable strongly influencing LD1. We can also see the groups of observations vary a lot in LD1 (x-axis), especially the high group clustered on the other end. LD2 (y-axis) does not show a discriminative power, so does not capture / group the differences in the explanatory variables well.

4.4 Predictions

After training the model, we can now predict classes with the LDA model on the test data. If we look at the categorical accuracy, we can see that the accuracy of the predictions for high is highest, followed by med_low, low, and med_high (95%, 64%, 60%, and 42%, respectively). This is also evident in the cross-tabulation. Most high values were correctly predicted, only 1 was wrongly predicted as med_high.

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
conf <- table(correct = correct_classes, predicted = lda.pred$class)
conf
##           predicted
## correct    low med_low med_high high
##   low       17      14        1    0
##   med_low    5      13        8    0
##   med_high   0       7       18    2
##   high       0       0        0   17
# calculate precision
diag(conf) / rowSums(conf)
##       low   med_low  med_high      high 
## 0.5312500 0.5000000 0.6666667 1.0000000

4.5 Distance measures and k-means clustering

To state whether objects are similar to one another or not, we can also measure distances. The most common distance measure is the Euclidean distance, which is the length of a straight line (distance) between two points and its x and y coordinate.

K-means clustering is a commonly used clustering method to assign observations to groups (a. k. a. clusters) based on how similar they are.

# reload Boston data set
library(MASS)
data("Boston")
# standardise data set
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

# k-means clustering with 3 clusters
km <- kmeans(boston_scaled, centers = 3)
pairs(boston_scaled, col = km$cluster)

# k-means clustering with 4 clusters
km <- kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km$cluster)

# k-means clustering with 5 clusters
km <- kmeans(boston_scaled, centers = 5)
pairs(boston_scaled, col = km$cluster)

From the different plots above, 4 clusters seem to represent the differences in the explanatory variables relatively well. Rather than manually trying to find the optimal k for the k-means, however, we can determine the number of clusters by looking at the changes of the total of within cluster sum of squares (WCSS). The optimal number of clusters is when we can see a radical drop in TWCSS, which seems to be around 2 (see below).

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


5. Dimensionality reduction techniques

This file describes the work and results of the fifth week a. k. a. “Dimensionality reduction techniques” of the IDOS2023 course.

5.1 Graphical and numerical overview of data

The ‘human’ data set comes from the United Nations Development Programme a d uses the HDI (Human Development Index) to assess the development on a country according to citizen attributes, not only economic growth (more information here: https://hdr.undp.org/data-center/human-development-index#/indicies/HDI).

The data shows that the mean life expectancy at birth (Life.Exp) varies a lot, with a minimum of 49 years, a median of 74 years, and a maximum of 83 years, respectively (see summary below). A high percentage of the population enjoys secondary education, both Female and Male (Edu2.FM). This is also well represented with the expected years of education (Edu.Exp) being around 13 years. The labour force participation rate (Labo.FM) shows a median of 75%. The representation of women in the parliament varies from none (min = 0) to a maximum of 57%. On average, every fifth member of a parliament is female.

# access all packages needed in this chunk
library(dplyr)
library(readr)
library(tibble)
library(GGally)
library(corrplot)

# load in data
human <- read.csv(file='/home/ntriches/github_iods2023/IODS23/data/human.csv', header=TRUE)

# move the country names to rownames
human_country_as_row <- column_to_rownames(human, "Country")

# summary
summary(human_country_as_row)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# visualize the 'human_' variables
plot_corr_humans <- human_country_as_row %>%
  ggpairs(progress = FALSE,
        upper = list(continuous = wrap("cor", size = 9)))
# plot corr_humans and adjust font size
plot_corr_humans + 
  theme(axis.text  = element_text(size = 20),
        strip.text = element_text(size = 20))

# compute the correlation matrix and visualize it with corrplot
cor_matrix <- cor(human_country_as_row) %>%
  round(digits = 2)

# visualise the correlation matrix
# cl.cex = change font size of number-labels in colour-legend
# addCoef.col = add correlation value to circle, number.cex = adjust the font size of the number
plot_cor_matrix <- cor_matrix %>%
  corrplot(method="circle", addCoef.col = 0.5, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 2,
           number.cex = 2, cl.cex = 2)

Many of the variables show some correlation (see above). From the correlation matrix, we can see that Life.Exp and the Maternal mortality ratio (Mat.Mor) have a strong relative correlation (-0.86). Also the Expected years of schooling (Edu.Exp) and Mat.Mor, Edu.Exp and Adolescent birth rate (Ado.Birth), and Life.Exp and Ado.Birth are negatively correlated with -0.74, -0.7, and -0.73, respectively. On the other hand, Edu.Exp and Life.Exp, and Mat.Mor and Ado.Birth and strongly positively correlated witz 0.79 and 0.76, respectively. According to the black-and-white plots from ggpairs, all these relationships are significant. Looking at the distribution of our variables, we can see that Edu.Exp is the only normally distributed variable. Gross National Income per capita (GNI) and Mat.Mor are strongly right skewed, and so are Ado.Birth and Parli.F, although to a lesser extent. Both Labo.FM and Life.Exp are left skewed.

5.2 Principal component analysis (PCA)

Principal component analysis (PCA) helps us to summarise and visualise data with more than three dimensions. It is a statistical approach that can be used to analyse high-dimensional data and capture the most important information from it. In that way, linear combinations of original predictions are changed into principal components that explain a large portion of the variation in a data set.

Sources: https://www.datacamp.com/tutorial/pca-analysis-r https://www.statology.org/principal-components-analysis-in-r/ https://www.datacamp.com/tutorial/pca-analysis-r http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/

5.2.1 PCA on non-standardised data

First, we perform PCA on the raw human data. We can see that all values are very low in all principal components (PC 1-8 in summary on top). From the resulting biplot (see below), it seems that only GNI has an influence in the first principal component (PC1), and that all other variables are clustered. This can be seen on the horizontal x-axis, which represents PC1, where GNI stands alone on one side of the graph.

# perform principal component analysis 
pca_human <- prcomp(human_country_as_row, )
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM    5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04  0.0022935252
## Labo.FM   -2.331945e-07  0.0002819357  5.302884e-04 -4.692578e-03 -0.0022190154
## Edu.Exp    9.562910e-05 -0.0075529759  1.427664e-02 -3.313505e-02 -0.1431180282
## Life.Exp   2.815823e-04 -0.0283150248  1.294971e-02 -6.752684e-02 -0.9865644425
## GNI        9.999832e-01  0.0057723054 -5.156742e-04  4.932889e-05  0.0001135863
## Mat.Mor   -5.655734e-03  0.9916320120  1.260302e-01 -6.100534e-03 -0.0266373214
## Ado.Birth -1.233961e-03  0.1255502723 -9.918113e-01  5.301595e-03 -0.0188618600
## Parli.F    5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01  0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02 -6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01  3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01 -5.380452e-03  2.281723e-03
## GNI       -2.711698e-05  8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03 -1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02  8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02  2.642548e-03  2.680113e-03
# create and print out a summary of pca_human
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)

# print out the percentages of variance
pca_pr
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (0.9999%)" "PC2 (1e-04%)"  "PC3 (0%)"      "PC4 (0%)"     
## [5] "PC5 (0%)"      "PC6 (0%)"      "PC7 (0%)"      "PC8 (0%)"
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(2, 2), col = c("grey40", "deeppink2"), 
       xlab = "Principal component 1", ylab = "Principal component 2",
       sub = "The non-standardised, raw data varies largely in scales. This leads to two very different axis scales from -0.2 to 1.0 and -2000 to 100000, making it hard to summarise the whole data set.")

5.2.2 PCA on standardised data

When we standardise the data, we can see in the summary that in the first principal component (PC1), several variables show similar values, indicating that PC1 explains around a third to half of the variation in most variables. PC2 shows the highest values for Labo.FM and Parli.F, indicating that this PC describes the most variation in these variables (-0.72 and -0.65, respectively). When we print the percentages of the variance, we can see that PC1 explains around 54%, and PC2 around 16% of the variance, respectively.

# scale human data 
human_std <- scale(human_country_as_row)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM    0.35664370 -0.03796058 -0.24223089 -0.62678110 -0.5983585
## Labo.FM   -0.05457785 -0.72432726 -0.58428770 -0.06199424  0.2625067
## Edu.Exp    0.42766720 -0.13940571 -0.07340270  0.07020294  0.1659678
## Life.Exp   0.44372240  0.02530473  0.10991305  0.05834819  0.1628935
## GNI        0.35048295 -0.05060876 -0.20168779  0.72727675 -0.4950306
## Mat.Mor   -0.43697098 -0.14508727 -0.12522539  0.25170614 -0.1800657
## Ado.Birth -0.41126010 -0.07708468  0.01968243 -0.04986763 -0.4672068
## Parli.F    0.08438558 -0.65136866  0.72506309 -0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644 -0.16459453
## Labo.FM   -0.03500707 -0.22729927  0.07304568
## Edu.Exp   -0.38606919  0.77962966  0.05415984
## Life.Exp  -0.42242796 -0.43406432 -0.62737008
## GNI        0.11120305 -0.13711838  0.16961173
## Mat.Mor    0.17370039  0.35380306 -0.72193946
## Ado.Birth -0.76056557 -0.06897064  0.14335186
## Parli.F    0.13749772  0.00568387  0.02306476
# create and print out a summary of pca_human
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)

# print out the percentages of variance
pca_pr
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (0.53605%)" "PC2 (0.16237%)" "PC3 (0.09571%)" "PC4 (0.07583%)"
## [5] "PC5 (0.05477%)" "PC6 (0.03595%)" "PC7 (0.02634%)" "PC8 (0.01298%)"
# draw a biplot
plot <- biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), 
               xlab = "Principal component 1: Poor countries with no women rights (left) vs. rich countries with high education and life expectancy", ylab = "Women in work and politics",
       sub = "The standardised, scaked data reduces the large variation in scales. This allows the principal components to better represent the variation in the original data set.") 

In the biplot above, we can see that Labo.RM and Parli.F are located very close to each other, but that also Mat.Mor and Ado.Birth, and Edu.Exp, Edu.FM, Life.Exp, and GNI are very clustered. Looking at the different axes (PC1 = x-axis, PC2 = y-axis), we can again see that the PC1 differentiates largely between Ado.Birth and Mat.Mor as one group on the left and Life.Exp, Edu.Exp, Edu2.FM, and GNI as another group on the right. A third group is represented by Labo.FM and Parli.FM. The latter is the only group where PC2 captures a difference in the variance compared to all other variables.

As evident in the two biplots, the results from the raw and standardised data are very different. Looking back at the raw data, we can see that the different variables vary largely in scales. While Edu2.FM and Labo.FM range from 0.17 to 1.50 and 0.19 to 1.04, respectively, GNI ranges from 581 to 123124. The within-variable range and scale therefore differ widely. Since the variance in GNI is so large both in distance and number, it dominates the PCA when the variables are not scaled. In the scaled data set, however, it is possible to capture the differences in the variance of the variables since their values are much closer to one another. This allows us to better visualise the important information and groups in the data set.

A proper pnterpretation of the PCA is difficult. Nevertheless, it can be assumed that the maternal mortality ratio (Mat.Mor) and adolescent birth rate (Ado.Birth) are clustered around countries such as Liberia, Congo, Chad, and the Central African Republic because women might not have access to education and/ or contraceptives. As a result, women seem get pregnant very early (between 10 and 14 years old), but risk death die during birth. The cluster of Life.Exp, Edu.Exp, Edu2.FM, and GNI might indicate that the higher the life expectancy and expected years of schooling, the higher the Gross National Income per capita. Since most countries situated on the right of the biplot are developed countries with a high GNI, this interpretation seems likely. Labo.FM and Parli.F are most likely clustered because women are much more likely to be involved in politics if they work.

5.3 Multipe Correspondence Analysis (MCA)

The Multiple Correspondence Analysis (MCA) is a generalisation of PCA, and used when the variables to be analysed are categorical instead of quantitative. The goal is to identify groups of individuals with similar answers to asked questions, and the associations between the categories of the variables.

To show how MCA works, we use the tea data from the FactoMineR package. In this data set, 300 people (observations / rows) were asked 4 personal questions, 18 questions on how they drink tea, and 12 questions on how they perceive the products. To simplify our analysis, we choose six (6 columans) variables in our MCA, of which all are categorical levels. From the summary, we can see what kind of tea they chose to drink, how they drank it, when they drank it and where they got the tea from. The distribution is well-shown in the histograms below.

Source: http://sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials

## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 

There are several ways to assess the results of the MCA. If we want to know the proportion of variances retained by different dimensions, we can look at the so-called screeplot. There, we can see that the first dimension (Dim. 1) explains 15.2% of the variance in the data set, Dim. 2 14.2%, and so on. To explain all the variation in the data set, 11 dimensions are needed (see summary output from eig.val).

# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

#install.packages("factoextra")
library(factoextra)
# get proportion of variances retained by different dimensions 
eig.val <- get_eigenvalue(mca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  0.27937118        15.238428                    15.23843
## Dim.2  0.26092645        14.232352                    29.47078
## Dim.3  0.21933575        11.963768                    41.43455
## Dim.4  0.18943794        10.332978                    51.76753
## Dim.5  0.17722310         9.666715                    61.43424
## Dim.6  0.15617745         8.518770                    69.95301
## Dim.7  0.14375727         7.841306                    77.79432
## Dim.8  0.14126310         7.705260                    85.49958
## Dim.9  0.11717818         6.391537                    91.89111
## Dim.10 0.08660997         4.724180                    96.61529
## Dim.11 0.06205294         3.384706                   100.00000
# screeplot
fviz_screeplot(mca, addlabels = TRUE) # visualise percentages of variances 

To see how well the variable categories are represented, we can use the squared cosine (cos2). Cos2 indicates how much the variable categories are correlated with a particular axis (or principal component) by measuring the degree of association. If a variable category is well represented by two dimensions, the sum of the cos2 is closed to one. For some variables, more than 2 dimensions are required to perfectly represent the data. Luckily, it is possible to to show a colour gradient in a factor map which greatly facilitates assessing the quality of representation (see below). Low cos2 values are coloured in white, mid values in blue, and high cos2 values in red, respectively. From our plot, we can see that the variable categories are generally not very well represented. The location where the tea was bought seems to be best represented, but all other variables show low cos2 values.

# extract results for variable categories
var <- get_mca_ind(mca)
# Cos2: quality on the factor map (cos2)
head(var$cos2)
##        Dim 1     Dim 2      Dim 3      Dim 4        Dim 5
## 1 0.08622131 0.1047019 0.10430844 0.12973506 2.851705e-06
## 2 0.03646577 0.0120024 0.31368575 0.05599092 4.347624e-01
## 3 0.23148162 0.1534390 0.06910327 0.03886304 1.205832e-01
## 4 0.46021410 0.1659588 0.07321306 0.04045475 1.540730e-01
## 5 0.23148162 0.1534390 0.06910327 0.03886304 1.205832e-01
## 6 0.23148162 0.1534390 0.06910327 0.03886304 1.205832e-01
# plot cos2 
fviz_mca_var(mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal())

# Contributions to the principal components
# Contributions of rows to dimension 1
fviz_contrib(mca, choice = "var", axes = 1, top = 15)

# Contributions of rows to dimension 2
fviz_contrib(mca, choice = "var", axes = 2, top = 15)

# Total contribution to dimension 1 and 2
fviz_contrib(mca, choice = "var", axes = 1:2, top = 15)

# The most important (or, contributing) variable categories can be highlighted on the scatter plot as follow:
fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )

Using histograms, we can evaluate the contributions of each variable to the dimensions 1 and 2 individually and combined (see above). In dim 1, tea shop, unpackaged, tea bag, and chain store contribute the most to the dimensions. In dim 2, chain store+tea shop, tea bag+unpacked, tea shop, unpackaged, other, and green are above the expected average value if contributions were uniform (red dashed line). The combined plot adds shows that tea shop, unpackaged, and chain store+tea shop are the most important contributions, which is again showed in a coloured scatter plot.

# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

Finally, we can visualise the MCA factor map (see above). The different colours show the six variable groups we chose in the beginning. Again, tea shop and unpackaged somewhat stand alone. According to Dim 1, we could interpret that the more “sofisticated” tea drinkers greatly enjoy good quality tea, and would rather buy unpackaged tea in a tea shop. On the other hand, “modest” tea drinkers would buy tea bags in a chain store. According to Dim 2, there are no great differences between tea bags and unpackaged tea, or where the tea is bought. Here we could suggest that people who drink Earl Grey might more likely add milk and sugar to their tea.


6. Analysis of longitudinal data

This file describes the work and results of the sixth week a. k. a. “Analysis of longitudinal data” of the IDOS2023 course.

6.1 Simple analysis method on RATS data: Graphical and numerical overview

Many studies in the behavioral sciences involve several measurement or observations of the response variable of interest on each subject in the study. For example, the response variable may be measured under a number of different experimental conditions or on a number of different occasions over time; such data are labelled repeated measures or longitudinal data. Longitudinal data poses problems for analysis because the repeated measurements on each subject are very likely to be correlated rather than independent.

Graphical displays of data are almost always useful for exposing patterns in the data, particularly when these are unexpected; this might be of great help in suggesting which class of models might be most sensibly applied in the later more formal analysis. It is important to note that the simple methods should be used only in the initial stage of dealing with the data; more appropriate methods will be discussed in 6.2.

Source: Multivariate Analysis for the Behavioral Sciences, Second Edition (2019), special version for IODS course. Note that the text is not paraphrased.

6.1.1 Scatter plot

We perform the first simple analysis on the RATS data set, in which groups of rats were put on different diets to see if the growth profiles differ between the groups. Each rat’s body weight (in grams) was recorded approximately every week for a period of 9 weeks, except for week seven, in which two measurements where taken. First, we create a plot where we separate the three different rat groups, showing their weight in grams (y-axis) over the 9-week-period of their diet (see below).

# load necessary libraries
library(ggplot2)
library(dplyr)
# load in data set
RATS <- read.csv(file = "/home/ntriches/github_iods2023/IODS23/data/rats.csv", header = TRUE)
str(RATS) # ID and Group are not factors
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# RATS: factor Id and group 
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(RATS)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110
# plot 
ggplot(RATS, aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(linetype = Group, colour = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

We can see that rats in group 1 (88 rats, solid red line) were the lightest group, with low within-group variation. Rats in group 2 (44 rats, dashed green line) were heavier and show a massive outlier (top line on the graph). Overall, rats in group 3 (44 rats, dashed blue line) were the heaviest. We can also see that the weight of all rats slightly increased over time, and that large individual differences and variability appear to decrease with time. Another important we can see from the plot above is that rats who were heavier in the start of the diet tend to be heavier at the end of the diet, too. This phenomenon is generally referred to as tracking. To see this more clearly, we can standardise the values of each observation, i.e., the values obtained by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation:

\[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]

library(tidyr)
# standardise the variable weight (grams)
RATS <- RATS %>%
  group_by(Time) %>%
  mutate(StdWeight= (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()
# plot again with the standardised weights
ggplot(RATS, aes(x = Time, y = StdWeight, group = ID)) +
  geom_line(aes(linetype = Group, colour = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Standardised Weight") +
  theme(legend.position = "top")

By standardising variables, we can see the scales effect on some variables differently. In this case, however, the overall picture does not show any difference and does not reveal more information.

6.1.2 Summary graphs

With large numbers of observations, graphical displays of individual response profiles do not show us very much. Instead, we can produce graphs showing us the average (mean) profiles for each treatment group, along with some indication of the variation of the observations at each time point, in this case the standard error of mean:

\[se = \frac{sd(x)}{\sqrt{n}}\]

# Summary data with mean and standard error of weight by group and time (days) 
# number of ID per group 
RATS_summary <- RATS %>%
  group_by(Group, Time) %>%
    summarise(mean = mean(Weight, na.rm = TRUE),
              n    = n(),
              se   = sd(Weight)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATS_summary, aes(x = Time, y = mean, linetype = Group, shape = Group, colour = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

We can now see that there is some overlap in the mean profiles of group 2 and 3 (see above). To further test the differences between the groups, we can do side-by-side box plots of the observations at each time point (see below).

# Create a summary data by group and ID with mean as the summary variable 
RATS_summary_groupID <- RATS%>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATS_summary_groupID)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(RATS_summary_groupID, aes(x = Group, y = mean, colour = Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")

# remove outliers in group 1 and 2
RATS_summary_groupID_noOutliers <- RATS%>%
  filter(!Weight < 250,
         !Weight > 550) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# plot without 2 extreme outliers
ggplot(RATS_summary_groupID_noOutliers, aes(x = Group, y = mean, colour = Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")

From the plots above, we can see that rats in group 1 are considerably lighter than group 2 and 3. Groups 2 and 3 are much closer, with group 2 showing the highest within-group variation of all groups. It is evident that each group has one outlier. In group 1, there seems to be a very light rat, whereas in group 2, a relatively heavy rat is shown as outlier. Also group 3 has an outliers, which is within the variation of group 2. If we remove the two extremer outliers from group 1 and 2, the within-variation in group 2 is highly reduced, and the differences between the groups become larger. Nevertheless, because we can assume that the outliers are a product of real data, I will not remove them from the data set.

6.1.2 Linear model / ANOVA

To test the differences between the groups in a more formal way, we could use ANOVA to estimate how quantitative dependent variables (= mean of Weight) change according to the levels of one or more categorical independent variables (= Group). ANOVA will then test whether there is a difference in the means of the groups at each level of the independent variable. Alternatively, we can use linear models to get the information on the intercept and interactions, as well, which is what we will do. The t-test cannot be used since we have more than two groups.

Source for ANOVA: https://www.scribbr.com/statistics/anova-in-r/

What we saw in the plots above is evident from the regression model: group 2 and 3 are significantly different from group 1 (p < 0.001). There is also a significant positive relationship with Time (p < 0.001). When we continue adding a random intercept, intercept and slope, and finally intercept, slope, and interaction, the variation in the data set is represented better in each model. We can see this with the decreasing AIC and BIC values, and the significant differences between the models. As a result, the RATS_intercept_slope_interaction_model can be considered as the most accurate model. Looking at the marginal R2, the model appears to explain 92% of the variation in the data. Next to what the regression model already told us, we can see that the interaction between Time and Group is mostly relevant for group 3.

library(lme4)
# install.packages("lmerTest")
library(lmerTest)
# create a regression model RATS_reg
RATS_regression_model <- lm(Weight ~ Time + Group, data = RATS)
# summary
summary(RATS_regression_model)
## 
## Call:
## lm(formula = Weight ~ Time + Group, data = RATS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.643 -24.017   0.697  10.837 125.459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 244.0689     5.7725  42.281  < 2e-16 ***
## Time          0.5857     0.1331   4.402 1.88e-05 ***
## Group2      220.9886     6.3402  34.855  < 2e-16 ***
## Group3      262.0795     6.3402  41.336  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.34 on 172 degrees of freedom
## Multiple R-squared:  0.9283, Adjusted R-squared:  0.9271 
## F-statistic: 742.6 on 3 and 172 DF,  p-value: < 2.2e-16
# create a random intercept model
RATS_intercept_model <- lmer(Weight ~ Time + Group + (1 | ID), data = RATS, REML = FALSE)
# create a random intercept and random slope model
RATS_intercept_slope_model <- lmer(Weight ~ Time + Group + (Time | ID), data = RATS, REML = FALSE)
# create a random intercept and random slope model with interaction 
RATS_intercept_slope_interaction_model <- lmer(Weight ~ Time * Group + (Time | ID), data = RATS, REML = FALSE)

# perform an ANOVA test on the models
anova(RATS_intercept_model, RATS_intercept_slope_model)
## Data: RATS
## Models:
## RATS_intercept_model: Weight ~ Time + Group + (1 | ID)
## RATS_intercept_slope_model: Weight ~ Time + Group + (Time | ID)
##                            npar    AIC    BIC  logLik deviance  Chisq Df
## RATS_intercept_model          6 1333.2 1352.2 -660.58   1321.2          
## RATS_intercept_slope_model    8 1194.2 1219.6 -589.11   1178.2 142.94  2
##                            Pr(>Chisq)    
## RATS_intercept_model                     
## RATS_intercept_slope_model  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(RATS_intercept_slope_model, RATS_intercept_slope_interaction_model)
## Data: RATS
## Models:
## RATS_intercept_slope_model: Weight ~ Time + Group + (Time | ID)
## RATS_intercept_slope_interaction_model: Weight ~ Time * Group + (Time | ID)
##                                        npar    AIC    BIC  logLik deviance
## RATS_intercept_slope_model                8 1194.2 1219.6 -589.11   1178.2
## RATS_intercept_slope_interaction_model   10 1185.9 1217.6 -582.93   1165.9
##                                         Chisq Df Pr(>Chisq)   
## RATS_intercept_slope_model                                    
## RATS_intercept_slope_interaction_model 12.361  2    0.00207 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# install.packages("performance")
library(performance)
# show the summary of the last model
summary(RATS_intercept_slope_interaction_model)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: Weight ~ Time * Group + (Time | ID)
##    Data: RATS
## 
##      AIC      BIC   logLik deviance df.resid 
##   1185.9   1217.6   -582.9   1165.9      166 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.4249  0.0726  0.6034  2.7511 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  ID       (Intercept) 1.106e+03 33.2534       
##           Time        4.925e-02  0.2219  -0.15
##  Residual             1.975e+01  4.4439       
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept) 251.65165   11.79473  16.03276  21.336 3.38e-13 ***
## Time          0.35964    0.08215  16.00100   4.378 0.000468 ***
## Group2      200.66549   20.42907  16.03276   9.823 3.46e-08 ***
## Group3      252.07168   20.42907  16.03276  12.339 1.34e-09 ***
## Time:Group2   0.60584    0.14229  16.00100   4.258 0.000601 ***
## Time:Group3   0.29834    0.14229  16.00100   2.097 0.052269 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Time   Group2 Group3 Tm:Gr2
## Time        -0.160                            
## Group2      -0.577  0.092                     
## Group3      -0.577  0.092  0.333              
## Time:Group2  0.092 -0.577 -0.160 -0.053       
## Time:Group3  0.092 -0.577 -0.053 -0.160  0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00504145 (tol = 0.002, component 1)
# show the r2 of the model
r2(RATS_intercept_slope_interaction_model)
## # R2 for Mixed Models
## 
##   Conditional R2: 0.999
##      Marginal R2: 0.930

6.2 Linear Mixed Effects Models on BPRS data

Following up on the models used above (6.1.2.), we will use linear mixed effects models on BPRS data. The BPRS (brief psychiatric rating scale) data gives us information on 40 males in two different treatment groups, which were rated on the BPRS before treatment (week 0) and then at weekly intervals for 8 weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe).

# load in data sets
BPRS <- read.csv(file = "/home/ntriches/github_iods2023/IODS23/data/bprs.csv", header = TRUE)
str(BPRS) # treatment and subject are not factors
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
# BPRS: factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
head(BPRS)
##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
summary(BPRS)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252
# plot
ggplot(BPRS, aes(x = week, y = bprs, linetype = subject, colour = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))

From the plot above, we can see that the BPRS score of almost all men who participated decreased over the eight weeks of the study. Men who had higher values of BPRS in the beginning tend to have higher values throughout the study. The substantial individual differences and variability appear to decrease with time.

6.2.1 Multiple regression

Ignoring the repeated-measures structure of the data, we will fit a multiple linear regression model with bprs as response and week and treatment as explanatory variables.

# create a regression model 
BPRS_regression_model <- lm(bprs ~ week + treatment, data = BPRS)

# print out a summary of the model
summary(BPRS_regression_model)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From the output above, we can see that men in treatment group 2 do not show significantly different results from men in treatment group 1. However, there is a negative correlation with week (estimate: -2.2, p < 0.001), showing that the treatments generally improved (= decreased) the BPRS score of the participants.

6.2.2 The Random Intercept Model

The previous model assumes independence of the repeated measures of bprs, and this assumption is highly unlikely. So, now we will move on to consider both some more appropriate graphics and appropriate models. We will first fit the random intercept model for the same two explanatory variables: Week and Treatment. Fitting a random intercept model allows the linear regression fit for each male to differ in intercept from other males.

# Create a random intercept model
BPRS_intercept_model <- lmerTest::lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
# Print the summary of the model
summary(BPRS_intercept_model)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9090  37.2392  24.334   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment2    0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

6.2.3 Random Intercept and Random Slope Model

Now we can move on to fit the random intercept and random slope model to the bprs data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the mens bprs profile, but also the effect of time (weeks).

# create a random intercept and random slope model
BPRS_intercept_slope_model <- lmerTest::lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
# print a summary of the model
summary(BPRS_intercept_slope_model)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.1052  22.6795  22.066  < 2e-16 ***
## week         -2.2704     0.2977  19.9991  -7.626 2.42e-07 ***
## treatment2    0.5722     1.0405 320.0005   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRS_intercept_model, BPRS_intercept_slope_model)
## Data: BPRS
## Models:
## BPRS_intercept_model: bprs ~ week + treatment + (1 | subject)
## BPRS_intercept_slope_model: bprs ~ week + treatment + (week | subject)
##                            npar    AIC    BIC  logLik deviance  Chisq Df
## BPRS_intercept_model          5 2748.7 2768.1 -1369.4   2738.7          
## BPRS_intercept_slope_model    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2
##                            Pr(>Chisq)  
## BPRS_intercept_model                   
## BPRS_intercept_slope_model    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the two models with ANOVA, we can see that the model including the slope is representing the data slightly better than the one without the slope: it’s AIC is 2745.4 compared to 2748.7. However, the BIS of the first model is lower for the intercept model. We will keep trying to find the best model.

6.2.4 Random Intercept and Random Slope Model with interaction

We can fit a random intercept and slope model that allows for a week × treatment interaction.

# create a random intercept and random slope model with the interaction
BPRS_intercept_slope_interaction_model <- lmerTest::lmer(bprs ~ week * treatment + (week | subject), data = BPRS, REML = FALSE)
# print a summary of the model
summary(BPRS_intercept_slope_interaction_model)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.2521  29.6312  21.262  < 2e-16 ***
## week             -2.6283     0.3589  41.7201  -7.323 5.24e-09 ***
## treatment2       -2.2911     1.9090 319.9977  -1.200   0.2310    
## week:treatment2   0.7158     0.4010 319.9977   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_intercept_slope_model, BPRS_intercept_slope_interaction_model) 
## Data: BPRS
## Models:
## BPRS_intercept_slope_model: bprs ~ week + treatment + (week | subject)
## BPRS_intercept_slope_interaction_model: bprs ~ week * treatment + (week | subject)
##                                        npar    AIC    BIC  logLik deviance
## BPRS_intercept_slope_model                7 2745.4 2772.6 -1365.7   2731.4
## BPRS_intercept_slope_interaction_model    8 2744.3 2775.4 -1364.1   2728.3
##                                         Chisq Df Pr(>Chisq)  
## BPRS_intercept_slope_model                                   
## BPRS_intercept_slope_interaction_model 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The intercept slope interaction model has a slightly lower AIC of 2744.3, but the differences between the last two models are not significant (p > 0.05). Since the results of the different models do not differ greatly from one another, the random intercept slope model appears to be the best choice. It takes into account the fact that the measures of bprs are not independent and accounts for the effect of time (weeks).



  1. Doctoral Programme in Atmospheric Sciences at the University of Helsinki.↩︎